amd配置

AMD GPU Pytorch 配置

安装AMD显卡驱动
sudo apt update
sudo apt install wget gnupg2
#查看系统版本
lsb_release -a
# 安装驱动
wget https://repo.radeon.com/amdgpu-install/5.4.2/ubuntu/jammy/amdgpu-install_5.4.50402-1_all.deb
sudo apt-get install ./amdgpu-install_5.4.50402-1_all.deb
# 然后安装单个用例
sudo amdgpu-install --usecase=rocm,hip,mllib --no-dkms
# 添加用户至render组
sudo usermod -a -G video,render $LOGNAME
#下载miopen内核,用于减少启动时间
sudo apt-cache search miopenkernels-gfx908
sudo apt-get install miopenkernels-gfx908-120kdb

配置环境

# 需要额外加一行参数
export HSA_OVERRIDE_GFX_VERSION=9.0.8

重启然后验证

reboot
# 显示gpu信息
rocm-smi

# 两项都显示gpu信息
/opt/rocm/bin/rocminfo
/opt/rocm/opencl/bin/clinfo
配置torch
conda create -n py39 python=3.9
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm5.2

测试

import torch
torch.cuda.is_available()
# output = True 即可以调用gpu

测试速度

import torch
import torch.nn as nn
import numpy as np
import time
x = torch.randn([1024,1024],device=torch.device('cuda:0'))

start = time.time()
r = x @ x
end = time.time()
print(end - start)

x2 = x.cpu()

start = time.time()
for i in range(10):
r2 = x2 @ x2
end = time.time()
print(end - start)

x3 = x2.to(torch.device('cuda:0'))

start = time.time()
for i in range(10):
r3 = x3 @ x3
end = time.time()
print(end - start)
#

似乎第一次需要初始化,耗时较长,3.0左右

第二次耗时在0.0002左右

docker 配置

由于 pip 安装的backward慢,且docker中 apex、deepspeed库更容易安装,所以使用docker安装

docker pull rocm/pytorch:latest
docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G --name amd_torch_jiangwei -p 6606:22 --mount type=bind,source=/ganzhi/ssd/data,target=/data6 rocm/pytorch:latest
配置ssh 登录(可选)

配置参考docker/docker ssh远程连接篇

重新启动

docker run -d --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 32G --name torch20_ssh -p 6606:22 --mount type=bind,source=/ganzhi/ssd/data,target=/data6 pytorch2-0:ssh /usr/sbin/sshd -D

torch2.0 依赖安装(可选)

先在/etc/apt/source.list加入以下内容

deb http://apt.llvm.org/focal/ llvm-toolchain-focal-13 main
deb-src http://apt.llvm.org/focal/ llvm-toolchain-focal-13 main

然后执行以下shell命令

sudo apt update

sudo apt install clang-13

解决torch.compile报错

SystemError: <built-in function load_binary> returned NULL without setting an exception
export ROCM_PATH=/opt/rocm-5.4.2
rm -rf /tmp/*

解决找不到cmath.h

sudo apt install libstdc++-12-dev

torch等rocm相关包手动安装

安装依赖 , 安装apex必备

sudo apt install rocm-dkms rocm-dev rocm-libs miopen-hip miopengemm hipsparse rccl rocthrust hipcub roctracer-dev

安装rocm-dkms可以安装上rocm-clang

安装torchaudio

git clone https://github.com/pytorch/audio.git
cd audio
python setup.py install

版本依赖解决

指定版本如:

sudo apt install rocm-dev5.2.4
sudo amdgpu-install --usecase=rocm,hip,mllib --no-dkms --rocmrelease=5.2.4

最好不要手动指定,amdgpu-install已经包含了版本信息
依赖问题dpkg faild to overwrite ….

sudo dpkg -P xxx

tensorflow 安装

#卸载旧版本
sudo amdgpu-uninstall
sudo apt autoremove
#安装新版本
wget https://repo.radeon.com/amdgpu-install/6.3.2/ubuntu/jammy/amdgpu-install_6.3.60302-1_all.deb
sudo apt-get install ./amdgpu-install_6.3.60302-1_all.deb
# 然后安装单个用例
sudo amdgpu-install --usecase=rocm,hip,mllib --no-dkms

pip3 install https://repo.radeon.com/rocm/manylinux/rocm-rel-6.3.2/tensorflow_rocm-2.15.1-cp310-cp310-manylinux_2_28_x86_64.whl

docker 方式安装

docker pull rocm/tensorflow:latest

docker run -it --network=host --device=/dev/kfd --device=/dev/dri \
--ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined rocm/tensorflow:latest

两种方式都报错tensorflow.python.framework.errors_impl.UnknownError: Failed to query available memory for GPU 0

参考官网重新安装

官网地址: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html

安装rocm

wget https://repo.radeon.com/amdgpu-install/6.4/ubuntu/jammy/amdgpu-install_6.4.60400-1_all.deb
sudo apt install ./amdgpu-install_6.4.60400-1_all.deb
sudo apt update
sudo apt install python3-setuptools python3-wheel
sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups
sudo apt install rocm

按照官网不需要安装hip,mllib,节约了一半内存
安装dkms

sudo apt install "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)"
sudo apt install amdgpu-dkms

安装tensorflow

pip install tensorflow-rocm==2.18.1 -f https://repo.radeon.com/rocm/manylinux/rocm-rel-6.4/ --upgrade

重启

reboot

此方式测试成功,推测原因是安装了dkms或者是没有安装hip

------ 本文结束 🎉🎉 谢谢观看 ------
0%